Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Issues when trying to create a parquet file with FixedSizedListArray #691

Closed
jmdeschenes opened this issue Dec 17, 2021 · 2 comments · Fixed by #941
Closed

Issues when trying to create a parquet file with FixedSizedListArray #691

jmdeschenes opened this issue Dec 17, 2021 · 2 comments · Fixed by #941
Labels
bug Something isn't working

Comments

@jmdeschenes
Copy link

Hello,

I am trying to generate a dataset with Complex Number using a FixedSizedListArray, but I am getting a few issues along they way.

Here is the code

I generate a parquet file with the following snippet

use std::fs::File;
use std::sync::Arc;

use arrow2::error::Result;
use arrow2::io::parquet::write::to_parquet_schema;
use arrow2::{
    buffer::MutableBuffer,
    bitmap::MutableBitmap,
    datatypes::DataType,
    array::{Array, FixedSizeListArray, MutablePrimitiveArray, MutableFixedSizeListArray, TryExtend},
    datatypes::{Field, Schema},
    io::parquet::write::{
        write_file, Compression,
        Encoding, Version, WriteOptions, RowGroupIterator
    },
    record_batch::RecordBatch,
};

fn write_batch(path: &str, batch: RecordBatch) -> Result<()> {
    let schema = batch.schema().clone();
    let options = WriteOptions {
        write_statistics: true,
        compression: Compression::Uncompressed,
        version: Version::V2,
    };
    let parquet_schema = to_parquet_schema(&schema)?;

    let iter = vec![Ok(batch)];
    let row_groups = 
        RowGroupIterator::try_new(iter.into_iter(), &schema, options, vec![Encoding::Plain])?;
    // Create a new empty file
    let mut file = File::create(path)?;

    // Write the file. Note that, at present, any error results in a corrupted file.
    let _ = write_file(
        &mut file,
        row_groups,
        &schema,
        parquet_schema,
        options,
        None,
    )?;
    Ok(())
}

fn main() -> Result<()> {
    // If there is a none in this value the file can no longer be read from pyarrow
    let data: Vec<Option<Vec<Option<f64>>>> = vec![
        Some(vec![Some(1.0), Some(1.0)]),            
        Some(vec![Some(2.0), Some(2.0)]),
        Some(vec![Some(3.0), Some(3.0)])
    ];
    let buffer = MutableBuffer::new();
    let mut list = MutableFixedSizeListArray::new(
        MutablePrimitiveArray::<f64>::from_data(DataType::Float64,
            buffer,
            Some(MutableBitmap::new())),
            2);
    list.try_extend(data)?;
    let array: FixedSizeListArray = list.into();
    
    let schema = Schema::new(vec![Field::new("a", array.data_type().clone(), true)]);
    let batch = RecordBatch::try_new(Arc::new(schema), vec![Arc::new(array)])?;
    write_batch("test.parquet", batch)?;
    Ok(())
}

And I try to read the file with the followin snippet

import pyarrow.parquet as pq
df = pq.read_table("test.parquet")
print(df)

The problems are as follows:

  1. The resulting dataframe is missing the last row
pyarrow.Table
a: fixed_size_list<item: double>[2]
  child 0, item: double
----
a: [[[1,1],[2,2]]]
  1. If there is a None in the data vector the file can no longer be read from pyarrow(if it is in the last position, the None is not read, when the null is in the inner values, it works as intended)
  File "pyarrow/_dataset.pyx", line 491, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3235, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Expected all lists to be of size=2 but index 1 had size=1
  1. I would like to get an array with no nullable, but I don't know how to get it to work. Removing the bitmap
    There are several issues that I would like to have but I am not sure how to get it to work.
    let mut list = MutableFixedSizeListArray::new(
        MutablePrimitiveArray::<f64>::from_data(DataType::Float64,
            buffer,
            None),
            2);

Changing to this yield an error on pyarrow side:

  File "pyarrow/_dataset.pyx", line 491, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3235, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Expected all lists to be of size=2 but index 1 had size=1

Ideally, the data would look like:

    let data: Vec<Option<Vec<Option<f64>>>> = vec![
        vec![1.0, 1.0],
        vec![2.0, 2.0],
        vec![3.0, 3.0],
    ];

Let me know if you need more details.

@jorgecarleitao jorgecarleitao added the bug Something isn't working label Dec 18, 2021
@jorgecarleitao
Copy link
Owner

Yeah, we do not have an integration test for this type yet, so there is little assurance, unfortunately. I would actually expect the code to bail on this type, but it actually passes all the way through and writes the file :O

@jorgecarleitao
Copy link
Owner

Hi @jmdeschenes , I found the root cause for the missing item. That is fixed in #941. AFAI can see the issue about the size and so on is due to the Field being nullable: true but the array not containing any null values?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants