Skip to content

Parquet file in invalid state while writing to S3 from EMR #1971

@asfimport

Description

@asfimport

I'm writing parquet to S3 from Spark 1.6.1 on EMR. And when it got to the last few files to write to S3, I received this stacktrace in the log with no other errors before or after it. It's very consistent. This particular batch keeps erroring the same way.

2016-06-10 01:46:05,282] WARN org.apache.spark.scheduler.TaskSetManager [task-result-getter-2hread] - Lost task 3737.0 in stage 2.0 (TID 10585, ip-172-16-96-32.ec2.internal): org.apache.spark.SparkException: Task failed while writing rows.
	at org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:414)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: The file being written is in an invalid state. Probably caused by an error thrown previously. Current state: COLUMN
	at org.apache.parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:146)
	at org.apache.parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:138)
	at org.apache.parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:195)
	at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:153)
	at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113)
	at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetRelation.scala:101)
	at org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:405)
	... 8 more

Reporter: Peter Halliday

Note: This issue was originally created as PARQUET-632. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions