Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 10 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
Parquet MR [![Build Status](https://travis-ci.org/apache/incubator-parquet-mr.svg?branch=master)](http://travis-ci.org/apache/incubator-parquet-mr)
======

Parquet-MR contains the java implementation of the [Parquet format](https://github.com/Parquet/parquet-format).
Parquet-MR contains the java implementation of the [Parquet format](https://github.com/apache/parquet-format).
Parquet is a columnar storage format for Hadoop; it provides efficient storage and encoding of data.
Parquet uses the [record shredding and assembly algorithm](https://github.com/Parquet/parquet-mr/wiki/The-striping-and-assembly-algorithms-from-the-Dremel-paper) described in the Dremel paper to represent nested structures.

Expand Down Expand Up @@ -95,27 +95,27 @@ Parquet is a very active project, and new features are being added quickly; belo

## Map/Reduce integration

[Input](https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java) and [Output](https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetOutputFormat.java) formats.
[Input](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java) and [Output](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java) formats.
Note that to use an Input or Output format, you need to implement a WriteSupport or ReadSupport class, which will implement the conversion of your object to and from a Parquet schema.

We've implemented this for 2 popular data formats to provide a clean migration path as well:

### Thrift
Thrift integration is provided by the [parquet-thrift](https://github.com/Parquet/parquet-mr/tree/master/parquet-thrift) sub-project. If you are using Thrift through Scala, you may be using Twitter's [Scrooge](https://github.com/twitter/scrooge). If that's the case, not to worry -- we took care of the Scrooge/Apache Thrift glue for you in the [parquet-scrooge](https://github.com/Parquet/parquet-mr/tree/master/parquet-scrooge) sub-project.
Thrift integration is provided by the [parquet-thrift](https://github.com/apache/parquet-mr/tree/master/parquet-thrift) sub-project. If you are using Thrift through Scala, you may be using Twitter's [Scrooge](https://github.com/twitter/scrooge). If that's the case, not to worry -- we took care of the Scrooge/Apache Thrift glue for you in the [parquet-scrooge](https://github.com/apache/parquet-mr/tree/master/parquet-scrooge) sub-project.

### Avro
Avro conversion is implemented via the [parquet-avro](https://github.com/Parquet/parquet-mr/tree/master/parquet-avro) sub-project.
Avro conversion is implemented via the [parquet-avro](https://github.com/apache/parquet-mr/tree/master/parquet-avro) sub-project.

### Create your own objects
* The ParquetOutputFormat can be provided a WriteSupport to write your own objects to an event based RecordConsumer.
* the ParquetInputFormat can be provided a ReadSupport to materialize your own objects by implementing a RecordMaterializer

See the APIs:
* [Record conversion API](https://github.com/Parquet/parquet-mr/tree/master/parquet-column/src/main/java/parquet/io/api)
* [Hadoop API](https://github.com/Parquet/parquet-mr/tree/master/parquet-hadoop/src/main/java/parquet/hadoop/api)
* [Record conversion API](https://github.com/apache/parquet-mr/tree/master/parquet-column/src/main/java/parquet/io/api)
* [Hadoop API](https://github.com/apache/parquet-mr/tree/master/parquet-hadoop/src/main/java/parquet/hadoop/api)

## Apache Pig integration
A [Loader](https://github.com/Parquet/parquet-mr/blob/master/parquet-pig/src/main/java/parquet/pig/ParquetLoader.java) and a [Storer](https://github.com/Parquet/parquet-mr/blob/master/parquet-pig/src/main/java/parquet/pig/ParquetStorer.java) are provided to read and write Parquet files with Apache Pig
A [Loader](https://github.com/apache/parquet-mr/blob/master/parquet-pig/src/main/java/org/apache/parquet/pig/ParquetLoader.java) and a [Storer](https://github.com/apache/parquet-mr/blob/master/parquet-pig/src/main/java/org/apache/parquet/pig/ParquetStorer.java) are provided to read and write Parquet files with Apache Pig

Storing data into Parquet in Pig is simple:
```
Expand All @@ -134,7 +134,7 @@ If the data was stored using Pig, things will "just work". If the data was store

## Hive integration

Hive integration is provided via the [parquet-hive](https://github.com/Parquet/parquet-mr/tree/master/parquet-hive) sub-project.
Hive integration is provided via the [parquet-hive](https://github.com/apache/parquet-mr/tree/master/parquet-hive) sub-project.

## Build

Expand Down Expand Up @@ -222,7 +222,7 @@ The build runs in [Travis CI](http://travis-ci.org/Parquet/parquet-mr):

### How To Contribute

If you are looking for some ideas on what to contribute, check out GitHub issues for this project labeled ["Pick me up!"](https://github.com/Parquet/parquet-mr/issues?labels=pick+me+up%21&state=open).
If you are looking for some ideas on what to contribute, check out GitHub issues for this project labeled ["Pick me up!"](https://github.com/apache/parquet-mr/issues?labels=pick+me+up%21&state=open).
Comment on the issue and/or contact [the parquet-dev group](https://groups.google.com/d/forum/parquet-dev) with your questions and ideas.

We tend to do fairly close readings of pull requests, and you may get a lot of comments. Some common issues that are not code structure related, but still important:
Expand All @@ -243,7 +243,7 @@ We tend to do fairly close readings of pull requests, and you may get a lot of c
* Jonathan Coveney <http://twitter.com/jco>
* Brock Noland <https://github.com/brockn>
* Tianshuo Deng <https://github.com/tsdeng>
* and many others -- see the [Contributor report]( https://github.com/Parquet/parquet-mr/contributors)
* and many others -- see the [Contributor report]( https://github.com/apache/parquet-mr/contributors)

## Code of Conduct

Expand Down
2 changes: 1 addition & 1 deletion parquet-benchmarks/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
<artifactId>parquet-benchmarks</artifactId>
<packaging>jar</packaging>
<name>Apache Parquet Benchmarks</name>
<url>https://github.com/Parquet/parquet-mr</url>
<url>https://github.com/apache/parquet-mr</url>

<properties>
<jmh.version>1.3.4</jmh.version>
Expand Down
2 changes: 1 addition & 1 deletion parquet-tools/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Parquet Tools
======

Parquet-Tools contains java based command line tools that aid
in the inspection of [Parquet files](https://github.com/Parquet).
in the inspection of [Parquet files](https://parquet.apache.org).

Currently these tools are available for UN*X systems.

Expand Down
14 changes: 7 additions & 7 deletions parquet_cascading.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,12 +25,12 @@ This document details the support of reading and writing parquet format from cas
1. Read and Write
==============

In [parquet-cascading](http://https://github.com/Parquet/parquet-mr/tree/master/parquet-cascading) sub-module, we provide support for reading/writing records of various data structures including Thrift(TBase), Scrooge and Tuples. Please refer to following sections for each data structures.
In [parquet-cascading](https://github.com/apache/parquet-mr/tree/master/parquet-cascading) sub-module, we provide support for reading/writing records of various data structures including Thrift(TBase), Scrooge and Tuples. Please refer to following sections for each data structures.

1.1 Thrift/TBase
------------
### Read Thrift Records from Parquet
[ParquetTbaseScheme](https://github.com/Parquet/parquet-mr/blob/master/parquet-cascading/src/main/java/parquet/cascading/ParquetTBaseScheme.java) is the interface for reading thrift records in Parquet format. Providing a ParquetTbaseScheme as a parameter to the constructor of a source enables the program to read Thrift object(TBase), eg.
[ParquetTbaseScheme](https://github.com/apache/parquet-mr/blob/master/parquet-cascading/src/main/java/org/apache/parquet/cascading/ParquetTBaseScheme.java) is the interface for reading thrift records in Parquet format. Providing a ParquetTbaseScheme as a parameter to the constructor of a source enables the program to read Thrift object(TBase), eg.

`
Scheme sourceScheme = new ParquetTBaseScheme(Name.class)
Expand All @@ -42,19 +42,19 @@ In the above example Name is a thrift class that extends TBase. Under the hood p
The thrift class is actually *optional* to initialize a ParquetTBaseScheme when the data is written as Thrift records in Parquet. When writing thrift records to parquet format, the Thrift class of the records is stored as meta-data in the footer of the parquet file. Therefore when reading the file, if a thrift class is not explicitly provided, Parquet will use the class name stored in the footer as the thrift class.

### Write Thrift Records to Parquet
[ParquetTbaseScheme](https://github.com/Parquet/parquet-mr/blob/master/parquet-cascading/src/main/java/parquet/cascading/ParquetTBaseScheme.java) can also be used by a sink. When used as a sink, the Thrift class of the records being written must be *explicitly* provided.
[ParquetTbaseScheme](https://github.com/apache/parquet-mr/blob/master/parquet-cascading/src/main/java/org/apache/parquet/cascading/ParquetTBaseScheme.java) can also be used by a sink. When used as a sink, the Thrift class of the records being written must be *explicitly* provided.

`
Scheme sinkScheme = new ParquetTBaseScheme(Name.class);
Tap sink = new Hfs(sinkScheme, parquetOutputPath);
`

For more concrete examples please refer to [TestParquetTBaseScheme](https://github.com/Parquet/parquet-mr/blob/master/parquet-cascading/src/test/java/parquet/cascading/TestParquetTBaseScheme.java)
For more concrete examples please refer to [TestParquetTBaseScheme](https://github.com/apache/parquet-mr/blob/master/parquet-cascading/src/test/java/org/apache/parquet/cascading/TestParquetTBaseScheme.java)

1.2 Scrooge
-----------
### Read Scrooge records from Parquet
Scrooge support is defined in a separate module called [parquet-scrooge](https://github.com/Parquet/parquet-mr/tree/master/parquet-scrooge). With [ParquetScroogeScheme](https://github.com/Parquet/parquet-mr/blob/master/parquet-scrooge/src/main/java/parquet/scrooge/ParquetScroogeScheme.java), data can be read in the form of Scrooge objects which are more scala friendly.
Scrooge support is defined in a separate module called [parquet-scrooge](https://github.com/apache/parquet-mr/tree/master/parquet-scrooge). With [ParquetScroogeScheme](https://github.com/apache/parquet-mr/blob/master/parquet-scrooge/src/main/java/org/apache/parquet/scrooge/ParquetScroogeScheme.java), data can be read in the form of Scrooge objects which are more scala friendly.

`
Scheme sinkScheme = new ParquetScroogeScheme(Name.class);
Expand All @@ -66,7 +66,7 @@ Tap sink = new Hfs(sinkScheme, parquetOutputPath);
1.3 Tuples
----------
### Read Cascading Tuples from Parquet
Currently, the support for reading tuples is mainly(but not limited) for data written from pig scripts as pig tuples. More comprehensive support will be added, but in the mean time, there are some limitations to notice: Nested structures are not supported. If the data is written as thrift objects which have nested structure, it can not be read at current time. *Data to read must be in flat structure*. To read data as tuples, simply use [ParquetTupleScheme](https://github.com/Parquet/parquet-mr/blob/master/parquet-cascading/src/main/java/parquet/cascading/ParquetTupleScheme.java):
Currently, the support for reading tuples is mainly(but not limited) for data written from pig scripts as pig tuples. More comprehensive support will be added, but in the mean time, there are some limitations to notice: Nested structures are not supported. If the data is written as thrift objects which have nested structure, it can not be read at current time. *Data to read must be in flat structure*. To read data as tuples, simply use [ParquetTupleScheme](https://github.com/apache/parquet-mr/blob/master/parquet-cascading/src/main/java/org/apache/parquet/cascading/ParquetTupleScheme.java):

`
Scheme sourceScheme = new ParquetTupleScheme(new Fields("last_name"));
Expand All @@ -75,7 +75,7 @@ Tap source = new Hfs(sourceScheme, parquetInputPath);

### Write Cascading Tuples to Parquet(coming soon)

For more examples please refer to [TestParquetTupleScheme](https://github.com/Parquet/parquet-mr/blob/master/parquet-cascading/src/test/java/parquet/cascading/TestParquetTupleScheme.java)
For more examples please refer to [TestParquetTupleScheme](https://github.com/apache/parquet-mr/blob/master/parquet-cascading/src/test/java/org/apache/parquet/cascading/TestParquetTupleScheme.java)

2. Projection Pushdown
======================
Expand Down