Skip to content

Parquet without Hadoop dependencies #2473

@asfimport

Description

@asfimport

I have been trying for weeks to create a parquet file from avro and write to S3 in Java.  This has been incredibly frustrating and odd as Spark can do it easily (I'm told).

I have assembled the correct jars through luck and diligence, but now I find out that I have to have hadoop installed on my machine. I am currently developing in Windows and it seems a dll and exe can fix that up but am wondering about Linus as the code will eventually run in Fargate on AWS.

Why do I need external dependencies and not pure java?

The thing really is how utterly complex all this is.  I would like to create an avro file and convert it to Parquet and write it to S3, but I am trapped in "ParquetWriter" hell! 

Why can't I get a normal OutputStream and write it wherever I want?

I have scoured the web for examples and there are a few but we really need some documentation on this stuff.  I understand that there may be reasons for all this but I can't find them on the web anywhere.  Any help?  Can't we get the "SimpleParquet" jar that does this:

 

ParquetWriter writer = AvroParquetWriter.<GenericData.Record>builder(outputStream)
.withSchema(avroSchema)
.withConf(conf)
.withCompressionCodec(CompressionCodecName.SNAPPY)
.withWriteMode(Mode.OVERWRITE)//probably not good for prod. (overwrites files).
.build();

 

Environment: Amazon Fargate (linux), Windows development box.

We are writing Parquet to be read by the Snowflake and Athena databases.
Reporter: mark juchems
Assignee: Atour Mousavi Gourabi / @amousavigourabi

Related issues:

PRs and other links:

Note: This issue was originally created as PARQUET-1822. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions