Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(parquet): Enable Parquet WASM loader #2773

Merged
merged 6 commits into from
Nov 4, 2023
Merged

Conversation

ibgreen
Copy link
Collaborator

@ibgreen ibgreen commented Nov 3, 2023

Making an attempt to reintegrate the WASM parquet loader / writer.

Notes - the WASM loader is amazing but still have limitations:

  • WASM loader appears to require entire file to be loaded into array buffer, can be prohibitive for large parquets.
  • The JS loader accepts a ReadableFile and does random access reads on that so it does not necessarily load all chunks into memory, but only loads chunks as needed.
  • No batched loading option exposed from Rust.
  • For the purposes of loading large files into browsers - it is good to just be able to load schema first, and then be able to select a few columns of interest, so that not all columns need to be "materialized" into browser memory.

@ibgreen ibgreen changed the title feat(parquet): Enable Parquet WASM writer feat(parquet): Enable Parquet WASM loader Nov 3, 2023
@ibgreen ibgreen marked this pull request as ready for review November 3, 2023 20:01
@ibgreen ibgreen requested a review from kylebarron November 3, 2023 21:01
@kylebarron
Copy link
Collaborator

  • WASM loader appears to require entire file to be loaded into array buffer, can be prohibitive for large parquets.

Either that or you need to make the requests from inside wasm. See https://kylebarron.dev/parquet-wasm/functions/esm_arrow2.readRowGroupAsync.html

As a development note, there are two bindings, arrow-rs and arrow2 (see docs note here). In recent Rust ecosystem developments, arrow2 is "dying" (main contributor stepped back and other contributor forked it inside of polars), and so I'm moving my work to arrow-rs from now on in general. the arrow-rs (i.e. arrow1) bindings in parquet-wasm don't have async read support yet but it will be added at some point.

There's also https://kylebarron.dev/parquet-wasm/functions/esm_arrow2.readParquetStream.html and https://kylebarron.dev/parquet-wasm/functions/esm_arrow1.readParquetStream.html added recently. They aren't documented well, though there's an example here: https://observablehq.com/d/f5723cea6661fb71. The main downside is that it looks like each column chunk is requested independently, and I'm not immediately sure how to get the underlying rust apis to batch the requests e.g. for each chunk.

  • For the purposes of loading large files into browsers - it is good to just be able to load schema first, and then be able to select a few columns of interest, so that not all columns need to be "materialized" into browser memory.

You can do that with readMetadataAsync. It exposes the parquet metadata and the arrow schema from the result of that call.

@ibgreen
Copy link
Collaborator Author

ibgreen commented Nov 4, 2023

You can do that with readMetadataAsync. It exposes the parquet metadata and the arrow schema from the result of that call.

Nice. I would also need to be able to apply a column filter to the reader so it doesn't read the columns I didn't select.

@ibgreen
Copy link
Collaborator Author

ibgreen commented Nov 4, 2023

So the big problem to make this run in browser. seems to be that the generated arrow1.js file in the parquet-wasm module contains a couple of unneeded node-specific imports. I've been playing with stripping those out post-install, but I think it would be best if parquet-wasm did that. If there are no options to the compiler to emit browser-compatible output, let's just use e.g. sed or perl to remove them?

The annoying util import at line 3 can be handle by

sed -i '' 's/require(`util`)/globalThis/g' node_modules/parquet-wasm/node/arrow1.js

And something similar is needed for the fs import at the end.

@kylebarron
Copy link
Collaborator

kylebarron commented Nov 4, 2023

node_modules/parquet-wasm/node/arrow1.js

you're using the node bundle. You should be using either the esm or the bundler entry point. See https://github.com/kylebarron/parquet-wasm#choice-of-bundles and https://rustwasm.github.io/docs/wasm-pack/commands/build.html#target (what they call web, I call esm)

What we did previously in loaders was try to conditionally use the node bundle when it was called from node, and use one of the others on the web. The node bundle is nice in node because it's a synchronous import

@ibgreen ibgreen merged commit de10c5d into master Nov 4, 2023
4 checks passed
@ibgreen ibgreen deleted the ib/geoparquet-wasm branch November 4, 2023 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants