Replies: 1 comment 6 replies
-
You could perhaps buffer up the DF target batch size number of structs, e.g. ~8000, and this ArrayOfStruct to StructOfArray You could possibly make use of https://docs.rs/arrow/46.0.0/arrow/#serde-compatibility to do this, or hand-roll your builders as shown https://docs.rs/arrow-array/46.0.0/arrow_array/builder/index.html#nested-usage. If you're only interested in converting to arrow, you shouldn't need anything as heavy as a query engine like DuckDB or DataFusion
I'm not sure there is a way around this, there is some schema inference logic but it is very limited in scope.
Enums are quite tricky to map to arrow, there is support for UnionArrays but the support for these is extremely limited across the ecosystem |
Beta Was this translation helpful? Give feedback.
-
In DuckDB, when I have a ndjson file, I can use
CREATE TABLE t AS SELECT * FROM read_ndjson('file.ndjson');
.I would avoid the interim ndjson step if I can, as the objective is to go from deeply nested Rust struct/enums to the Arrow world, as transparently as possible.
Currently, I'm serializing
BigStruct
to ndjsonString
, one at a time, and then writing out the ndjson file (8x the size of the source file). Then, using DuckDB's SQL above, I'm able to automagically get the data types back from plaintext (JSON).It is very desirable to go directly from Rust data types to inserting the native struct data type of DataFusion/DuckDB, etc.
Note that I don't have a
Vec<BigStruct>
;BigStruct
s are being produced in an async Stream.I understand this could be an ArrayOfStruct to StructOfArray 'problem', but I don't have an ArrayOfStruct to begin with, as they are produced in a streaming fashion (too many to keep it all in memory).
In addition to this example of writing JSON into DuckDB not working (it just writes the hex bytes in decimal), I lose all type information (
read_ndjson
via the CLI recreates all of it though), native support for Rust data types is a work in progress.Do you think this is something that is good to have for DataFusion, and if so, is it something in the works already?
Are there any examples I can look at?
Oh, and inferred schema would be best. The
BigStruct
s are quite big, and conceal a whole lot of variations. It would be a nightmare to write the schema for all of them.Thanks in advance.
Beta Was this translation helpful? Give feedback.
All reactions