-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter #2239
Comments
Michael Heuer: val job = HadoopUtil.newJob(sc)
val conf = ContextUtil.getConfiguration(job)
conf.setBoolean("parquet.avro.compatible", false)
|
antonkulaga: |
Thiruvalluvan M. G.: @Test
public void testConvertedSchemaToStringCantRedefineList() throws Exception {
String parquet = "message spark_schema {\n" +
" optional group annotation {\n" +
" optional group transcriptEffects (LIST) {\n" +
" repeated group list {\n" +
" optional group element {\n" +
" optional group effects (LIST) {\n" +
" repeated group list2 {\n" +
" optional binary element (UTF8);\n" +
" }\n" +
" }\n" +
" }\n" +
" }\n" +
" }\n" +
" }\n" +
"}\n";
Configuration conf = new Configuration(false);
AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
Schema schema = avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
schema.toString();
} I've verified that this indeed fixes this test.
|
Nándor Kollár / @nandorKollar: |
Thiruvalluvan M. G.: |
Nándor Kollár / @nandorKollar: |
Michael Heuer: As far as workarounds, I'm afraid we're so far downstream that I'm not sure we would be able to use one. We use Avro AVDL to generate Java objects for persisting Spark RDDs to Parquet and separately to generate Scala products for persisting Spark Datasets to Parquet. Spark generates the schema for these Datasets-as-Parquet. Up until Spark version 2.4.0, which bumped Parquet to version 1.10 and Avro to 1.8.2, we could write out Datasets-as-Parquet and read in RDDs-as-Parquet without trouble (the two different schema were considered compatible). |
Nándor Kollár / @nandorKollar: The unit test attached to this PR doesn't reflect the problem, because I think it tests the correct behaviour: in the converter one can switch between 2 and 3 level list with I think that the problem is that AvroRecordConverter tries to decide between 3 and 2 level list by first trying to interpret the schema as 2 level, and check the compatibility with the expected Avro schema. Normally, the two are incompatible (if it was written as 3 level), and Parquet will know that it is a 3-level list. This works fine when lists are not nested into other lists, but if we try to represent the 3 level nested list Parquet structure as 2 level, the resulting 2 level Avro schema is not even a valid Avro schema! |
Nándor Kollár / @nandorKollar: |
Michael Heuer: The regression is complicated and perhaps not worth discussing here, by Spark moving to Parquet 1.10 and Avro 1.8.2 our previous workaround of pinning parquet-avro to 1.8.1 no longer works. That workaround was necessary because Spark depended on Parquet 1.8.2 and Avro 1.7.x which were incompatible with each other. |
Nándor Kollár / @nandorKollar: |
Nándor Kollár / @nandorKollar: message Message {
optional group a1 {
required float a2;
optional group a1 {
required float a4;
}
}
} is not readable via AvroParquetReader. Of course this could be easily solved by renaming the inner a1 to something else, but for lists, this doesn't work. I think using Avro namespaces during schema conversion could fix this bug. |
The following unit test added to TestAvroSchemaConverter fails
while this one succeeds
I don't see a way to influence the code path in AvroIndexedRecordConverter to respect this configuration, resulting in the following stack trace downstream
See also downstream issues
https://issues.apache.org/jira/browse/SPARK-25588
bigdatagenomics/adam#2058
Reporter: Michael Heuer
Assignee: Nándor Kollár / @nandorKollar
Related issues:
PRs and other links:
Note: This issue was originally created as PARQUET-1441. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: