Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3.7.1 corrupt input when reading decimal fields #53

Closed
El-Gor-do opened this issue Apr 23, 2020 · 6 comments
Closed

v3.7.1 corrupt input when reading decimal fields #53

El-Gor-do opened this issue Apr 23, 2020 · 6 comments

Comments

@El-Gor-do
Copy link

Version: Parquet.Net v3.7.1

Runtime Version: .Net Core 3.1

OS: Windows

Expected behavior

ParquetRowGroupReader.ReadColumn(DataField field) should not throw an exception when reading a decimal field.

Actual behavior

System.IO.IOException: 'corrupt input' is thrown when there are more than 4096 items in the row group. This doesn't occur in v3.7.0.

The code snippet below writes a Parquet file to a MemoryStream then reads it back. In TestClass I have tested setting Value's data type to bool, double, int, long, DateTimeOffset and string classes, they can all be read without errors. Only decimal data type causes ParquetRowGroupReader.ReadColumn(...) to throw when there are more than 4096 items in a row group.

Steps to reproduce the behavior

  1. Run the console app in the code snippet below.

Code snippet reproducing the behavior

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;

using Parquet;
using Parquet.Data;

namespace ConsoleApp1
{
    class Program
    {
        static void Main()
        {
            // v3.7.1 fails if itemCount > 4096
            const int itemCount = 4097;

            // create items
            List<TestClass> items = Enumerable.Range(0, itemCount)
                                        .Select(i => new TestClass()
                                        {
                                            Value = i
                                        })
                                        .ToList();

            using (MemoryStream ms = new MemoryStream())
            {
                // serialise items
                CompressionMethod compressionMethod = CompressionMethod.Snappy;
                const int rowGroupSize = 5000;
                Schema schema = ParquetConvert.Serialize(items, ms, null, compressionMethod, rowGroupSize);
                ms.Position = 0;

                // create reader
                ParquetOptions parquetOptions = null;
                const bool leaveStreamOpen = true;
                ParquetReader reader = new ParquetReader(ms, parquetOptions, leaveStreamOpen);
                
                // get data field
                DataField dataField = reader.Schema
                                        .GetDataFields()
                                        .Single(f => f.Name.Equals(nameof(TestClass.Value)));
                
                // read values
                for (int i = 0; i < reader.RowGroupCount; ++i)
                {
                    using (ParquetRowGroupReader rowGroupReader = reader.OpenRowGroupReader(i))
                    {
                        // v3.7.0 runs correctly
                        // v3.7.1 throws System.IO.IOException: 'corrupt input' when itemCount > 4096
                        DataColumn dc = rowGroupReader.ReadColumn(dataField);

                        foreach (object value in dc.Data)
                        {
                            Console.WriteLine(value);
                        }
                    }
                }
            }
        }
    }

    class TestClass
    {
        public decimal Value { get; set; }
    }
}
@StereoPythonics
Copy link

I'm seeing the similar issue for integers also. Thank you for stalling my decent into madness trying to debug my serialization layer. Reverted to 3.7.0 and everything works.

@El-Gor-do
Copy link
Author

El-Gor-do commented May 4, 2020

In my code sample above, CompressionMethod.Snappy causes Decimal data types to fail to read when there are more than 4096 items in the Parquet file. If I change compression method to either Gzip or None then rowGroupReader.ReadColumn(dataField) doesn't throw. This suggests that the bug is in the new version of IronSnappy.

@lmant24
Copy link

lmant24 commented May 15, 2020

Parquet.net 3.7.1
Also have a similar situation. If i use gzip method to compress, write() perform successfully (empty table, singlerow/multirows table). I have many datetime-type columns.

@skyyearxp
Copy link
Contributor

the problem is not on Reading but on Writing. :)

and the problem have been solved.
https://github.com/aloneguid/IronSnappy/releases v1.2.2

but no new version released.
https://github.com/aloneguid/parquet-dotnet

@aloneguid
Copy link
Owner

@skyyearxp you are a star

@aloneguid
Copy link
Owner

releasing fix in 3.7.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants