Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-3731: MVP to read parquet in R library #3230

Closed
wants to merge 11 commits into from
10 changes: 10 additions & 0 deletions r/R/read_parquet.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#' Read parquet file from disk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs the license headers, you can just copy from any other R file.
https://travis-ci.org/apache/arrow/jobs/470349974#L493

#'
#' @param files a vector of filenames
#' @export
read_parquet = function(files) {
tables = lapply(files, function(f) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'd probably use purrr::map() or even purrr::map_dfr()

return (as_tibble(shared_ptr(`arrow::Table`, read_parquet_file(f))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could have only one parquet reader instead of one per file. But I need to read more about the parquet api first.

})
do.call('rbind', tables)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd probably use vctrs::vec_rbind(!!!tables) or dplyr::bind_rows(!!!tables) instead of do.call + rbind.

arrow currently does not depend on dplyr so I guess vctrs::vec_rbind(!!!tables) is preferred. @hadley

}
3 changes: 2 additions & 1 deletion r/configure
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,8 @@ if [ "$INCLUDE_DIR" ] || [ "$LIB_DIR" ]; then
elif [ "$PKGCONFIG_CFLAGS" ] || [ "$PKGCONFIG_LIBS" ]; then
echo "Found pkg-config cflags and libs!"
PKG_CFLAGS=${PKGCONFIG_CFLAGS}
PKG_LIBS=${PKGCONFIG_LIBS}
#PKG_LIBS=${PKGCONFIG_LIBS}
PKG_LIBS="-larrow -lparquet" # hard coded! what is the right way to do this?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should rather be addressed by updating the PKG_CONFIG_NAME higher

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this seems to work for me:

PKG_CONFIG_NAME="arrow parquet"

This needs an update to the README.Rmd (and README.md) to have -DARROW_PARQUET=ON, i.e.:

cmake .. -DARROW_PARQUET=ON -DCMAKE_BUILD_TYPE=Release -DARROW_BOOST_USE_SHARED:BOOL=Off

elif [[ "$OSTYPE" == "darwin"* ]]; then
if [ "$(command -v brew)" ]; then
BREWDIR=$(brew --prefix)
Expand Down
38 changes: 38 additions & 0 deletions r/src/parquetfilereader.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
// // Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cpp files must adhere to the style. You can enforce that with

./r/lint.sh --fix

// // or more contributor license agreements. See the NOTICE file
// // distributed with this work for additional information
// // regarding copyright ownership. The ASF licenses this file
// // to you under the Apache License, Version 2.0 (the
// // "License"); you may not use this file except in compliance
// // with the License. You may obtain a copy of the License at
// //
// // http://www.apache.org/licenses/LICENSE-2.0
// //
// // Unless required by applicable law or agreed to in writing,
// // software distributed under the License is distributed on an
// // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// // KIND, either express or implied. See the License for the
// // specific language governing permissions and limitations
// // under the License.
//
//
#include <arrow/api.h>
#include <arrow/io/api.h>
#include <parquet/arrow/reader.h>
#include <parquet/arrow/writer.h>
#include <parquet/exception.h>

// [[Rcpp::export]]
std::shared_ptr<arrow::Table> read_parquet_file(std::string filename) {
std::shared_ptr<arrow::io::ReadableFile> infile;
PARQUET_THROW_NOT_OK(arrow::io::ReadableFile::Open(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the name is a big hint, I'm not sure what the macro does and if it should perhaps be replaced by something that plays well with Rcpp.

filename, arrow::default_memory_pool(), &infile));

std::unique_ptr<parquet::arrow::FileReader> reader;
PARQUET_THROW_NOT_OK(
parquet::arrow::OpenFile(infile, arrow::default_memory_pool(), &reader));
std::shared_ptr<arrow::Table> table;
PARQUET_THROW_NOT_OK(reader->ReadTable(&table));

return table;
}