Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: ParquetRowGroupReader.ReadColumnAsync returning wrong values for Int32 columns #389

Closed
El-Gor-do opened this issue Aug 18, 2023 · 5 comments

Comments

@El-Gor-do
Copy link

El-Gor-do commented Aug 18, 2023

Library Version

4.16.0

OS

Windows

OS Architecture

64 bit

How to reproduce?

TestParquet.csproj

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>net7.0</TargetFramework>
  </PropertyGroup>
  <ItemGroup>
    <PackageReference Include="Parquet.Net" Version="4.15.0" />
  </ItemGroup>
</Project>

Program.cs

using Parquet;
using Parquet.Data;
using Parquet.Schema;
using Parquet.Serialization;
using System;
using System.Collections.Generic;
using System.IO;
using System.IO.Compression;
using System.Linq;
using System.Threading.Tasks;

namespace TestParquet
{
    internal class Program
    {
        public class TestClass
        {
            public int Value { get; set; }

            private static Random r { get; } = new Random(0);

            public TestClass()
            {
                this.Value = TestClass.r.Next(int.MinValue, int.MaxValue);
            }
        }

        static async Task Main(string[] args)
        {
            // create items
            int itemCount = 4;
            List<TestClass> items = Enumerable.Range(0, itemCount).Select(i => new TestClass()).ToList();

            List<int> actualValues = new List<int>();
            using (MemoryStream ms = new MemoryStream())
            {
                // create parquet stream from items
                ParquetSerializerOptions options = new ParquetSerializerOptions()
                {
                    Append = false,
                    CompressionLevel = CompressionLevel.SmallestSize,
                    CompressionMethod = CompressionMethod.Gzip,
                };
                ParquetSchema schema = await ParquetSerializer.SerializeAsync(items, ms, options);
                ms.Position = 0;

                // read values in parquet stream
                DataField field = schema.DataFields[0];
                ParquetReader reader = await ParquetReader.CreateAsync(ms, leaveStreamOpen: true);
                for (int rowGroupIndex = 0; rowGroupIndex < reader.RowGroupCount; ++rowGroupIndex)
                {
                    using (ParquetRowGroupReader rowGroupReader = reader.OpenRowGroupReader(rowGroupIndex))
                    {
                        // if itemCount > 4096 then this throws InvalidOperationException: 'don't know how to skip'
                        DataColumn dc = await rowGroupReader.ReadColumnAsync(field);

                        actualValues.AddRange(dc.Data.Cast<int>());
                    }
                }
            }

            // check for differences between expected and actual values
            for (int i = 0; i < items.Count; ++i)
            {
                int expectedValue = items[i].Value;
                int actualValue = actualValues[i];
                if (expectedValue != actualValue)
                    Console.WriteLine($"i {i} expected {expectedValue}, actual {actualValue}");
            }
        }
    }
}

Failing test

When running the TestParquet console app using Parquet.Net 4.15.0, nothing is printed which indicates that all values were correctly read from the stream.

Change the Parquet.Net package to 4.16.0 and run the app again, it prints i 3 expected -1945678310, actual -737718758 indicating that the returned value for the 4th value in the data column is incorrect.

Also in 4.16.0, if you change itemCount to any value > 4096 then DataColumn dc = await rowGroupReader.ReadColumnAsync(field); throws InvalidOperationException: 'don't know how to skip'

@El-Gor-do
Copy link
Author

TestClass.Value is specifcally set using Random.Next(...) to trigger the incorrect values being returned by rowGroupReader.ReadColumnAsync(...). If I simply set TestClass.Value to 0, 1, 2, ... then rowGroupReader.ReadColumnAsync(...) returns the correct values.

@El-Gor-do
Copy link
Author

Using Parquet.Net 4.16.1, the test app now correctly shows no output when itemCount <= 4096 but still throws InvalidOperationException: 'don't know how to skip' when itemCount > 4096.

@aloneguid
Copy link
Owner

Thanks, still working on this one, adding more test coverage.

@ee-naveen
Copy link
Contributor

ee-naveen commented Aug 21, 2023

This error occurs when array size = ((blocksize * i )+1) where i > 0
eg: 1025, 2049, 3073, ...

The first value must be added to the destination before reading the block.
Currently, the last element is not added to destination, because there is not bytes left to read.

DecodeInt function
image

@El-Gor-do
Copy link
Author

I have verified that v4.16.2 fixes InvalidOperationException: 'don't know how to skip'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants