Get schema from IPC (Feather v2) file without loading it first. #1114

ghuls · 2021-08-09T09:10:41Z

Describe your feature request

Get schema from IPC (Feather v2) file without loading it first.

Similar to what is possible with pyarrow dataset API:

ds = pyarrow.dataset.dataset(feather_file,` format="feather")

# Get all column names
ds.schema.names

Getting all column names before loading the data is useful in case you have a lot of columns (with unknown names). from which
you want to select only a subset.

import pyarrow as pa
import pyarrow.feather as pf
import pyarrow.dataset as ds

feather_file = 'test.feather'

In [9]: %time feather_v2_dataset = ds.dataset(feather_file, format="feather")
CPU times: user 8.74 s, sys: 1.19 s, total: 9.93 s
Wall time: 9.97 s

In [10]: %time column_names = feather_v2_dataset.schema.names
CPU times: user 485 ms, sys: 248 ms, total: 733 ms
Wall time: 733 ms

# Select 10 columns (2 million columns in total).
selected_column_names = column_names[0:100000:10000]

In [12]: len(selected_column_names)
Out[12]: 10

# Use pyarrow memory mapping to load only the columns of interest.
In [16]: %time pa_table = pf.read_table(feather_file, columns=selected_column_names, memory_map=True)
CPU times: user 6.85 s, sys: 32.5 s, total: 39.3 s
Wall time: 48 s

# Second time loading the data is faster (as it is still in cache). 
In [17]: %time pa_table = pf.read_table(feather_file, columns=selected_column_names, memory_map=True)
CPU times: user 6.17 s, sys: 1.75 s, total: 7.92 s
Wall time: 10.8 s

# Convert to polars dataframe.
%time pl_df = pl.DataFrame(pa_table)
CPU times: user 1.18 ms, sys: 183 µs, total: 1.37 ms
Wall time: 1.26 ms

# Without memory mapping it takes ages:
# Pyarrow Dataset API does not seem to use memory mapping at the moment and has similar timing

In [21]: %time pa_table = pf.read_table(feather_file, columns=selected_column_names, memory_map=False)
CPU times: user 9.38 s, sys: 4min 57s, total: 5min 7s
Wall time: 20min 47s

In [50]: %time pa_table_from_ds = feather_v2_dataset.to_table(columns=selected_column_names)
CPU times: user 7.36 s, sys: 5min 23s, total: 5min 30s
Wall time: 24min 13s

ghuls · 2021-08-09T09:20:36Z

Once getting columns from an IPC file is exposed, the following issue (once implemented) should help with faster loading of selected columns:
jorgecarleitao/arrow2#261
jorgecarleitao/arrow2#264

jorgecarleitao · 2021-08-09T16:06:46Z

I took a look at arrow2's code, and it seems that this is already possible: read_file_metadata returns FileMetadata, which contains FileMetadata::schema().

ritchie46 · 2021-08-10T09:22:23Z

I first have to expose a schema object to the python API.

@jorgecarleitao, I don't know much about IPC. Is it also possible to determine the size of a recordbatch, say only read n rows?

jorgecarleitao · 2021-08-10T10:42:20Z

It is possible, but not yet implemented ^_^

ghuls · 2021-09-13T15:02:24Z

@ritchie46 Is the IPC schema already exposed to the polars python API?

ritchie46 · 2021-09-13T18:02:11Z

@ghuls, no not yet. I also don't think arrow supports this at the moment.

ghuls · 2021-09-13T23:41:17Z

@ritchie46 You mean in arrow-rs? Isn't it supported here: https://github.com/apache/arrow-rs/blob/master/arrow/src/ipc/reader.rs#L649-L652

ritchie46 · 2021-09-14T04:50:27Z

Check!

ritchie46 · 2021-09-14T16:02:52Z

A bit late realization, but as this is supported in pyarrow. I don't think we should compile this behavior, and just create a python function that returns the schema. For the schema I think we should use a Dict[str, Polars DataType].

ghuls · 2021-09-14T16:34:17Z

Sounds good for now.

This still takes 10 seconds for a feather file with 1 million columns.

ds = pyarrow.dataset.dataset(feather_file,` format="feather")

# Get schema.
ds.schema

Reading actual data takes ages when not using memory mapping with pyarrow.

So support for projection pushdown when reading from an IPC file (as added in jorgecarleitao/arrow2#264) would be very welcome.

ritchie46 · 2021-09-14T16:54:41Z

I didn't realize this read data first. In that case we'd better look at this snippet:

#1114 (comment)

this could be done in the python bindings.

ghuls · 2021-09-16T20:54:16Z

@ritchie46
Can the projection option of arrow::io::ipc:read::FileReader::new:
https://github.com/jorgecarleitao/arrow2/blob/b77de38cc1cf804ba00e63daf7c0e9f26bb2fa5b/src/io/ipc/read/reader.rs#L234-L238
be exposed in polars (now None is passed unconditionally):

polars/polars/polars-io/src/ipc.rs

Line 97 in 629f501

let ipc_reader = read::FileReader::new(&mut self.reader, metadata, None);

ritchie46 · 2021-09-16T20:55:41Z

@ritchie46
Can the projection option of arrow::io::ipc:read::FileReader:🆕
https://github.com/jorgecarleitao/arrow2/blob/b77de38cc1cf804ba00e63daf7c0e9f26bb2fa5b/src/io/ipc/read/reader.rs#L234-L238
be exposed in polars (now None is passed unconditionally):

polars/polars/polars-io/src/ipc.rs

Line 97 in 629f501

let ipc_reader = read::FileReader::new(&mut self.reader, metadata, None);

Yes, this must be added now. Also a scan_ipc.

ghuls · 2021-09-17T13:39:29Z

Tested the new pl.read_ipc_schema function.

In [13]: feather_file = 'test.rankings.v2.feather'

# Get schem from Feather file with polars.
In [14]: %time feather_schema_polars = pl.read_ipc_schema(feather_file)
CPU times: user 2.36 s, sys: 405 ms, total: 2.77 s
Wall time: 2.76 s


# Get schema from Feather file with pyarrow (pyarrow might do some more stuff here).
In [15]: %time feather_dataset = ds.dataset(feather_file, format='feather')
CPU times: user 6.87 s, sys: 324 ms, total: 7.19 s
Wall time: 7.19 s

In [16]: feather_schema_pyarrow = feather_dataset.schema

# Number of columns
In [47]: len(feather_schema_polars)
Out[47]: 2208374

# Compare column names from schema extracted by polars and pyarrow.
In [48]: list(feather_schema_polars.keys()) == feather_schema_pyarrow.names
Out[48]: True

ritchie46 · 2021-10-20T09:46:35Z

This is possible now.

ghuls · 2021-10-20T11:51:21Z

@ritchie46
Can the projection option of arrow::io::ipc:read::FileReader:new
https://github.com/jorgecarleitao/arrow2/blob/b77de38cc1cf804ba00e63daf7c0e9f26bb2fa5b/src/io/ipc/read/reader.rs#L234-L238
be exposed in polars (now None is passed unconditionally):

polars/polars/polars-io/src/ipc.rs

Line 97 in 629f501

let ipc_reader = read::FileReader::new(&mut self.reader, metadata, None);

Yes, this must be added now. Also a scan_ipc.

Are you planning to support projection directly (instead of passing None unconditionally)?

ritchie46 · 2021-10-20T11:54:47Z

Oh, yes. Maybe we could make an issue for that and label it good_first_issue. Its a copy of the parquet logic.

ghuls · 2021-10-20T12:11:54Z

Are you sure parquet supports it already in non lazy mode?

ritchie46 · 2021-10-20T12:13:34Z

Are you sure parquet supports it already in non lazy mode?

Not entrirely. :P

ghuls · 2021-10-20T12:23:25Z

Created: #1569 which adds lazy projection for IPC and eager column reading for IPC and parquet.

ritchie46 closed this as completed Oct 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get schema from IPC (Feather v2) file without loading it first. #1114

Get schema from IPC (Feather v2) file without loading it first. #1114

ghuls commented Aug 9, 2021

ghuls commented Aug 9, 2021 •

edited

Loading

jorgecarleitao commented Aug 9, 2021

ritchie46 commented Aug 10, 2021

jorgecarleitao commented Aug 10, 2021

ghuls commented Sep 13, 2021

ritchie46 commented Sep 13, 2021

ghuls commented Sep 13, 2021

ritchie46 commented Sep 14, 2021

ritchie46 commented Sep 14, 2021

ghuls commented Sep 14, 2021

ritchie46 commented Sep 14, 2021

ghuls commented Sep 16, 2021

ritchie46 commented Sep 16, 2021

ghuls commented Sep 17, 2021

ritchie46 commented Oct 20, 2021

ghuls commented Oct 20, 2021

ritchie46 commented Oct 20, 2021

ghuls commented Oct 20, 2021

ritchie46 commented Oct 20, 2021

ghuls commented Oct 20, 2021

Get schema from IPC (Feather v2) file without loading it first. #1114

Get schema from IPC (Feather v2) file without loading it first. #1114

Comments

ghuls commented Aug 9, 2021

Describe your feature request

ghuls commented Aug 9, 2021 • edited Loading

jorgecarleitao commented Aug 9, 2021

ritchie46 commented Aug 10, 2021

jorgecarleitao commented Aug 10, 2021

ghuls commented Sep 13, 2021

ritchie46 commented Sep 13, 2021

ghuls commented Sep 13, 2021

ritchie46 commented Sep 14, 2021

ritchie46 commented Sep 14, 2021

ghuls commented Sep 14, 2021

ritchie46 commented Sep 14, 2021

ghuls commented Sep 16, 2021

ritchie46 commented Sep 16, 2021

ghuls commented Sep 17, 2021

ritchie46 commented Oct 20, 2021

ghuls commented Oct 20, 2021

ritchie46 commented Oct 20, 2021

ghuls commented Oct 20, 2021

ritchie46 commented Oct 20, 2021

ghuls commented Oct 20, 2021

ghuls commented Aug 9, 2021 •

edited

Loading