diff --git a/README.md b/README.md index 270c015a27..d6c1713fec 100644 --- a/README.md +++ b/README.md @@ -20,7 +20,7 @@ Parquet MR [![Build Status](https://travis-ci.org/apache/incubator-parquet-mr.svg?branch=master)](http://travis-ci.org/apache/incubator-parquet-mr) ====== -Parquet-MR contains the java implementation of the [Parquet format](https://github.com/Parquet/parquet-format). +Parquet-MR contains the java implementation of the [Parquet format](https://github.com/apache/parquet-format). Parquet is a columnar storage format for Hadoop; it provides efficient storage and encoding of data. Parquet uses the [record shredding and assembly algorithm](https://github.com/Parquet/parquet-mr/wiki/The-striping-and-assembly-algorithms-from-the-Dremel-paper) described in the Dremel paper to represent nested structures. @@ -95,27 +95,27 @@ Parquet is a very active project, and new features are being added quickly; belo ## Map/Reduce integration -[Input](https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java) and [Output](https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetOutputFormat.java) formats. +[Input](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java) and [Output](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java) formats. Note that to use an Input or Output format, you need to implement a WriteSupport or ReadSupport class, which will implement the conversion of your object to and from a Parquet schema. We've implemented this for 2 popular data formats to provide a clean migration path as well: ### Thrift -Thrift integration is provided by the [parquet-thrift](https://github.com/Parquet/parquet-mr/tree/master/parquet-thrift) sub-project. If you are using Thrift through Scala, you may be using Twitter's [Scrooge](https://github.com/twitter/scrooge). If that's the case, not to worry -- we took care of the Scrooge/Apache Thrift glue for you in the [parquet-scrooge](https://github.com/Parquet/parquet-mr/tree/master/parquet-scrooge) sub-project. +Thrift integration is provided by the [parquet-thrift](https://github.com/apache/parquet-mr/tree/master/parquet-thrift) sub-project. If you are using Thrift through Scala, you may be using Twitter's [Scrooge](https://github.com/twitter/scrooge). If that's the case, not to worry -- we took care of the Scrooge/Apache Thrift glue for you in the [parquet-scrooge](https://github.com/apache/parquet-mr/tree/master/parquet-scrooge) sub-project. ### Avro -Avro conversion is implemented via the [parquet-avro](https://github.com/Parquet/parquet-mr/tree/master/parquet-avro) sub-project. +Avro conversion is implemented via the [parquet-avro](https://github.com/apache/parquet-mr/tree/master/parquet-avro) sub-project. ### Create your own objects * The ParquetOutputFormat can be provided a WriteSupport to write your own objects to an event based RecordConsumer. * the ParquetInputFormat can be provided a ReadSupport to materialize your own objects by implementing a RecordMaterializer See the APIs: -* [Record conversion API](https://github.com/Parquet/parquet-mr/tree/master/parquet-column/src/main/java/parquet/io/api) -* [Hadoop API](https://github.com/Parquet/parquet-mr/tree/master/parquet-hadoop/src/main/java/parquet/hadoop/api) +* [Record conversion API](https://github.com/apache/parquet-mr/tree/master/parquet-column/src/main/java/parquet/io/api) +* [Hadoop API](https://github.com/apache/parquet-mr/tree/master/parquet-hadoop/src/main/java/parquet/hadoop/api) ## Apache Pig integration -A [Loader](https://github.com/Parquet/parquet-mr/blob/master/parquet-pig/src/main/java/parquet/pig/ParquetLoader.java) and a [Storer](https://github.com/Parquet/parquet-mr/blob/master/parquet-pig/src/main/java/parquet/pig/ParquetStorer.java) are provided to read and write Parquet files with Apache Pig +A [Loader](https://github.com/apache/parquet-mr/blob/master/parquet-pig/src/main/java/org/apache/parquet/pig/ParquetLoader.java) and a [Storer](https://github.com/apache/parquet-mr/blob/master/parquet-pig/src/main/java/org/apache/parquet/pig/ParquetStorer.java) are provided to read and write Parquet files with Apache Pig Storing data into Parquet in Pig is simple: ``` @@ -134,7 +134,7 @@ If the data was stored using Pig, things will "just work". If the data was store ## Hive integration -Hive integration is provided via the [parquet-hive](https://github.com/Parquet/parquet-mr/tree/master/parquet-hive) sub-project. +Hive integration is provided via the [parquet-hive](https://github.com/apache/parquet-mr/tree/master/parquet-hive) sub-project. ## Build @@ -222,7 +222,7 @@ The build runs in [Travis CI](http://travis-ci.org/Parquet/parquet-mr): ### How To Contribute -If you are looking for some ideas on what to contribute, check out GitHub issues for this project labeled ["Pick me up!"](https://github.com/Parquet/parquet-mr/issues?labels=pick+me+up%21&state=open). +If you are looking for some ideas on what to contribute, check out GitHub issues for this project labeled ["Pick me up!"](https://github.com/apache/parquet-mr/issues?labels=pick+me+up%21&state=open). Comment on the issue and/or contact [the parquet-dev group](https://groups.google.com/d/forum/parquet-dev) with your questions and ideas. We tend to do fairly close readings of pull requests, and you may get a lot of comments. Some common issues that are not code structure related, but still important: @@ -243,7 +243,7 @@ We tend to do fairly close readings of pull requests, and you may get a lot of c * Jonathan Coveney * Brock Noland * Tianshuo Deng -* and many others -- see the [Contributor report]( https://github.com/Parquet/parquet-mr/contributors) +* and many others -- see the [Contributor report]( https://github.com/apache/parquet-mr/contributors) ## Code of Conduct diff --git a/parquet-benchmarks/pom.xml b/parquet-benchmarks/pom.xml index 1bc0fe3a25..bd7afd5159 100644 --- a/parquet-benchmarks/pom.xml +++ b/parquet-benchmarks/pom.xml @@ -29,7 +29,7 @@ parquet-benchmarks jar Apache Parquet Benchmarks - https://github.com/Parquet/parquet-mr + https://github.com/apache/parquet-mr 1.3.4 diff --git a/parquet-tools/README.md b/parquet-tools/README.md index 0c3d56e612..d60e1b4dd5 100644 --- a/parquet-tools/README.md +++ b/parquet-tools/README.md @@ -21,7 +21,7 @@ Parquet Tools ====== Parquet-Tools contains java based command line tools that aid -in the inspection of [Parquet files](https://github.com/Parquet). +in the inspection of [Parquet files](https://parquet.apache.org). Currently these tools are available for UN*X systems. diff --git a/parquet_cascading.md b/parquet_cascading.md index 9ea483738d..b8a8718957 100644 --- a/parquet_cascading.md +++ b/parquet_cascading.md @@ -25,12 +25,12 @@ This document details the support of reading and writing parquet format from cas 1. Read and Write ============== -In [parquet-cascading](http://https://github.com/Parquet/parquet-mr/tree/master/parquet-cascading) sub-module, we provide support for reading/writing records of various data structures including Thrift(TBase), Scrooge and Tuples. Please refer to following sections for each data structures. +In [parquet-cascading](https://github.com/apache/parquet-mr/tree/master/parquet-cascading) sub-module, we provide support for reading/writing records of various data structures including Thrift(TBase), Scrooge and Tuples. Please refer to following sections for each data structures. 1.1 Thrift/TBase ------------ ### Read Thrift Records from Parquet -[ParquetTbaseScheme](https://github.com/Parquet/parquet-mr/blob/master/parquet-cascading/src/main/java/parquet/cascading/ParquetTBaseScheme.java) is the interface for reading thrift records in Parquet format. Providing a ParquetTbaseScheme as a parameter to the constructor of a source enables the program to read Thrift object(TBase), eg. +[ParquetTbaseScheme](https://github.com/apache/parquet-mr/blob/master/parquet-cascading/src/main/java/org/apache/parquet/cascading/ParquetTBaseScheme.java) is the interface for reading thrift records in Parquet format. Providing a ParquetTbaseScheme as a parameter to the constructor of a source enables the program to read Thrift object(TBase), eg. ` Scheme sourceScheme = new ParquetTBaseScheme(Name.class) @@ -42,19 +42,19 @@ In the above example Name is a thrift class that extends TBase. Under the hood p The thrift class is actually *optional* to initialize a ParquetTBaseScheme when the data is written as Thrift records in Parquet. When writing thrift records to parquet format, the Thrift class of the records is stored as meta-data in the footer of the parquet file. Therefore when reading the file, if a thrift class is not explicitly provided, Parquet will use the class name stored in the footer as the thrift class. ### Write Thrift Records to Parquet -[ParquetTbaseScheme](https://github.com/Parquet/parquet-mr/blob/master/parquet-cascading/src/main/java/parquet/cascading/ParquetTBaseScheme.java) can also be used by a sink. When used as a sink, the Thrift class of the records being written must be *explicitly* provided. +[ParquetTbaseScheme](https://github.com/apache/parquet-mr/blob/master/parquet-cascading/src/main/java/org/apache/parquet/cascading/ParquetTBaseScheme.java) can also be used by a sink. When used as a sink, the Thrift class of the records being written must be *explicitly* provided. ` Scheme sinkScheme = new ParquetTBaseScheme(Name.class); Tap sink = new Hfs(sinkScheme, parquetOutputPath); ` -For more concrete examples please refer to [TestParquetTBaseScheme](https://github.com/Parquet/parquet-mr/blob/master/parquet-cascading/src/test/java/parquet/cascading/TestParquetTBaseScheme.java) +For more concrete examples please refer to [TestParquetTBaseScheme](https://github.com/apache/parquet-mr/blob/master/parquet-cascading/src/test/java/org/apache/parquet/cascading/TestParquetTBaseScheme.java) 1.2 Scrooge ----------- ### Read Scrooge records from Parquet -Scrooge support is defined in a separate module called [parquet-scrooge](https://github.com/Parquet/parquet-mr/tree/master/parquet-scrooge). With [ParquetScroogeScheme](https://github.com/Parquet/parquet-mr/blob/master/parquet-scrooge/src/main/java/parquet/scrooge/ParquetScroogeScheme.java), data can be read in the form of Scrooge objects which are more scala friendly. +Scrooge support is defined in a separate module called [parquet-scrooge](https://github.com/apache/parquet-mr/tree/master/parquet-scrooge). With [ParquetScroogeScheme](https://github.com/apache/parquet-mr/blob/master/parquet-scrooge/src/main/java/org/apache/parquet/scrooge/ParquetScroogeScheme.java), data can be read in the form of Scrooge objects which are more scala friendly. ` Scheme sinkScheme = new ParquetScroogeScheme(Name.class); @@ -66,7 +66,7 @@ Tap sink = new Hfs(sinkScheme, parquetOutputPath); 1.3 Tuples ---------- ### Read Cascading Tuples from Parquet -Currently, the support for reading tuples is mainly(but not limited) for data written from pig scripts as pig tuples. More comprehensive support will be added, but in the mean time, there are some limitations to notice: Nested structures are not supported. If the data is written as thrift objects which have nested structure, it can not be read at current time. *Data to read must be in flat structure*. To read data as tuples, simply use [ParquetTupleScheme](https://github.com/Parquet/parquet-mr/blob/master/parquet-cascading/src/main/java/parquet/cascading/ParquetTupleScheme.java): +Currently, the support for reading tuples is mainly(but not limited) for data written from pig scripts as pig tuples. More comprehensive support will be added, but in the mean time, there are some limitations to notice: Nested structures are not supported. If the data is written as thrift objects which have nested structure, it can not be read at current time. *Data to read must be in flat structure*. To read data as tuples, simply use [ParquetTupleScheme](https://github.com/apache/parquet-mr/blob/master/parquet-cascading/src/main/java/org/apache/parquet/cascading/ParquetTupleScheme.java): ` Scheme sourceScheme = new ParquetTupleScheme(new Fields("last_name")); @@ -75,7 +75,7 @@ Tap source = new Hfs(sourceScheme, parquetInputPath); ### Write Cascading Tuples to Parquet(coming soon) -For more examples please refer to [TestParquetTupleScheme](https://github.com/Parquet/parquet-mr/blob/master/parquet-cascading/src/test/java/parquet/cascading/TestParquetTupleScheme.java) +For more examples please refer to [TestParquetTupleScheme](https://github.com/apache/parquet-mr/blob/master/parquet-cascading/src/test/java/org/apache/parquet/cascading/TestParquetTupleScheme.java) 2. Projection Pushdown ======================