Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
PARQUET-442: Nested schema conversion, Thrift struct decoupling, dump…
…-schema utility Several inter-related things here: * Added SchemaDescriptor and ColumnDescriptor types to hold computed structure information (e.g. max ref/def levels) about the file schema. These are used now in the FileReader and ColumnReader * I also added, very similar to parquet-mr (though leaned down), a logical schema node class structure which can be used for both the file reading and writing. * Added FlatSchemaConverter to convert Parquet flat schema metadata into a nested logical schema * Added a SchemaPrinter tool and parquet-dump-schema CLI tool to visit a nested schema and print it to the console. * Another big thing here is that per PARQUET-446 and related work in parquet-mr, it's important for both the public API of this project and internal development to limit our coupling to the compiled Thrift headers. I added `Type`, `Repetition`, and `LogicalType` enums to the `parquet_cpp` namespace and inverted the dependency between the column readers, scanners, and encoders to use these enums. * A bunch of unit tests. Author: Wes McKinney <wes@cloudera.com> Closes apache#38 from wesm/PARQUET-442 and squashes the following commits: 9ca0219 [Wes McKinney] Add a unit test for SchemaPrinter fdd37cd [Wes McKinney] Comment re: FLBA node ctor 3a15c0c [Wes McKinney] Add some SchemaDescriptor and ColumnDescriptor tests 27e1805 [Wes McKinney] Don't squash supplied CMAKE_CXX_FLAGS 76dd283 [Wes McKinney] Refactor Make* methods as static member functions 2fae8cd [Wes McKinney] Trim some includes b2e2661 [Wes McKinney] More doc about the parquet_cpp enums bd78d7c [Wes McKinney] Move metadata enums to parquet/types.h and add rest of parquet:: enums. Add NONE value to Compression 415305b [Wes McKinney] cpplint 4ac84aa [Wes McKinney] Refactor to make PrimitiveNode and GroupNode ctors private. Add MakePrimitive and MakeGroup factory functions. Move parquet::SchemaElement function into static FromParquet ctors so can set private members 3169b24 [Wes McKinney] NewPrimitive should set num_children = 0 always 954658e [Wes McKinney] Add a comment for TestSchemaConverter.InvalidRoot and uncomment tests for root nodes of other repetition types 55d21b0 [Wes McKinney] Remove schema-builder-test.cc 71c1eab [Wes McKinney] Remove crufty builder.h, will revisit 7ef2dee [Wes McKinney] Fix list encoding comment 8c5af4e [Wes McKinney] Remove old comment, unneeded cast 6b041c5 [Wes McKinney] First draft SchemaDescriptor::Init. Refactor to use ColumnDescriptor. Standardize on parquet_cpp enums instead of Thrift metadata structs. Limit #include from Thrift 841ae7f [Wes McKinney] Don't export SchemaPrinter for now 834389a [Wes McKinney] Add Node::Visotor API and implement a simple schema dump CLI tool a8bf5c8 [Wes McKinney] Catch and throw exception (instead of core dump) if run out of schema children. Add a Node::Visitor abstract API bde8b18 [Wes McKinney] Can compare FLBA type metadata in logical schemas f0df0ba [Wes McKinney] Finish a nested schema conversion test 0af0161 [Wes McKinney] Check that root schema node is repeated 5df00aa [Wes McKinney] Expose GroupConverter API, add test for invalid root beaa99f [Wes McKinney] Refactor slightly and add an FLBA test 6e248b8 [Wes McKinney] Schema tree conversion first cut, add a couple primitive tests 9685c90 [Wes McKinney] Rename Schema -> RootSchema and add another unit test f7d0487 [Wes McKinney] Schema types test coverage, move more methods into compilation unit d746352 [Wes McKinney] Better isolate thrift dependency. Move schema/column descriptor into its own header a8e5a0a [Wes McKinney] Tweaks fb9d7ad [Wes McKinney] Draft of flat to nested schema conversion. No tests yet 3015063 [Wes McKinney] More prototyping. Rename Type -> Node. PrimitiveNode factory functions a8a7a01 [Wes McKinney] Start drafting schema types Change-Id: I484f0a6f02d17d3905f2a40e3b0f17a01554a413
- Loading branch information