-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
An overwrite option for all sink types that write to HDFS #224
Comments
Hello, Is it possible to append the content of the file ? I tested In comparision, |
@mycaule Thank you - I have reopened this issue and we will look into this. |
@mycaule Unfortunately I cannot reproduce the problem - do you have a test case to reproduce the problem?
it should "support overwrite" in {
val path = new Path(s"target/${UUID.randomUUID().toString}", s"${UUID.randomUUID().toString}.pq")
val schema = StructType(Field("a", StringType))
val ds = DataStream.fromRows(
schema,
Seq(
Row(schema, Vector("x")),
Row(schema, Vector("y"))
)
)
// Write twice to test overwrite
ds.to(ParquetSink(path))
ds.to(ParquetSink(path).withOverwrite(true))
var parentStatus = fs.listStatus(path.getParent)
println("Parquet Overwrite:")
parentStatus.foreach(p => println(p.getPath))
parentStatus.length shouldBe 1
parentStatus.head.getPath.getName shouldBe path.getName
// Write again without overwrite
val appendPath = new Path(path.getParent, s"${UUID.randomUUID().toString}.pq")
ds.to(ParquetSink(appendPath).withOverwrite(false))
parentStatus = fs.listStatus(path.getParent)
println("Parquet Append:")
parentStatus.foreach(p => println(p.getPath))
parentStatus.length shouldBe 2
}
|
Hello my code looks like this. The file path is constant and I would like the file to be appended each time
This file may need to have the Hadoop method |
I think the problem occurs when testing with HDFS with a particular configuration of the cluster. I check my HDFS cluster configuration and it does support appending.
|
@mycaule
|
@mycaule In addition the hadoop // Now write to the same file in append mode and test that we have double the amount of rows
ds.to(CsvSink(path).withOverwrite(true))
ds.to(CsvSink(path).withAppend(true))
using(fs.open(path)) { inputStream =>
using(new BufferedReader(new InputStreamReader(inputStream))) { reader =>
val lines = reader.lines().toArray
println(lines.mkString("\n"))
lines.length shouldBe 4
}
} Which unfortunately yields the following exception:
The hadoop code says: @Override
public FSDataOutputStream append(Path f, int bufferSize,
Progressable progress) throws IOException {
throw new IOException("Not supported");
} |
@mycaule I would like to close this issue if you have no objections? |
Ok, thanks for investigating anyway ! |
Please close this. |
An overwrite option for all sink types that write to HDFS
Proposal
Sink.withOverwrite
AffectedSinks
The text was updated successfully, but these errors were encountered: