Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-3704: name validator interface #2053

Merged
merged 4 commits into from
Sep 12, 2023

Conversation

clesaec
Copy link
Contributor

@clesaec clesaec commented Jan 11, 2023

What is the purpose of the change

As explain in AVRO-3704, this is to allow, in Java, to choose name validation type. Currently, there is only a choice between no validation at all or a validation that accept accent, which is not accepted by official doc.
Here, we allow user to inject its own interface of name validation ; and, in same time, proposed 3 kind of implementation :

  • No validation : accept all names, like currently when option "no validation" is chosen.
  • Utf validation : reproduce current java code and accept accent.
  • Strict validation : follow documentation.

Verifying this change

This change added tests and can be verified as follows:
Unit test SchemaNameValidatorTest is added and can be run locally (like it is run in CI)

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? not documented.

@github-actions github-actions bot added the Java Pull Requests for Java binding label Jan 11, 2023

import java.util.stream.Stream;

class SchemaNameValidatorTest {

Check notice

Code scanning / CodeQL

Unused classes and interfaces Note test

Unused class: SchemaNameValidatorTest is not referenced within this codebase. If not used as an external API it should be removed.
@github-actions github-actions bot added the build label Jan 11, 2023
pom.xml Outdated
Comment on lines 157 to 161
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-resources-plugin</artifactId>
<version>${maven-resources-plugin.version}</version>
</plugin>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this addition: why was the maven-resources-plugin added? Was it to be able to specify a recent version number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI failed at "maven4" step, see last comment on AVRO-3701, i put this plugin from 1.7.0 to 3.0.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2075 - I've extracted just this change, so that all other PRs won't break because of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clesaec clesaec requested a review from opwvhk January 17, 2023 07:31
Copy link
Contributor

@opwvhk opwvhk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

One question (but it's also a matter of taste): should we nest everything in the Schema class? Or is name validation something you to before creating a schema/field with that name?

In the near future, I hope my PR to extract a separate SchemaParser will be merged (the last in the series #1588, #1589, #1954). This will (partially) extract parsing from the Schema class. Would that still be clean when the NameValidator interface is nested in the Schema class?

@clesaec
Copy link
Contributor Author

clesaec commented Jan 17, 2023

NameValidator has no direct link with Schema nor SchemaParser; so it could be easily embeded in SchemaParser or put in a separate Java file with your PR (i didn't figure out yet what would be the best).
And i agree with you, even if Schema.java is divided into several small independant pieces (classes), i also would prefer to see SchemaParser in a separate file (and mostly put Schema.parse(JsonNodes, Names) method outside of Schema class).
I also made this small PR on schema parsing that allow to load recursive protocol like

{"name": "User", "type": "record", "fields": [{"name": "current_status", "type": "Status"}]}
{"name": "Status", "type": "record", "fields": [{"name": "author", "type": "User"}]}

I would adapt it if yours is merge before :)

@opwvhk
Copy link
Contributor

opwvhk commented Jan 31, 2023

I would adapt it if yours is merge before :)

Ah, if only... but it also requires the ANTLR PRs (#1588 and #1589) to be merged, and there simple haven't been enough reviews yet.

@github-actions github-actions bot removed the build label Jan 31, 2023
@clesaec clesaec changed the title AVRO-3704: namae validator interface AVRO-3704: name validator interface Feb 20, 2023
@knotenpunkt
Copy link

This naming-thing was also a problem, if we use that avro-lib in combination with parquet. It is an nice adapter, but that validating of the names entailed that we can't write parquet files with "all" names, also at reading-parquets file in with that library caused in exceptions

Take a look here: https://stackoverflow.com/a/39734610/2182302

try (ParquetWriter<GenericData.Record> writer = AvroParquetWriter
        .<GenericData.Record>builder(fileToWrite)
        .withSchema(schema)
        .withConf(new Configuration())
        .withCompressionCodec(CompressionCodecName.SNAPPY)
        .build()) {
    for (GenericData.Record record : recordsToWrite) {
        writer.write(record);
    }
}   

I think a lot of projects use that AvroParquetWriter

Maybe we should deactivate that naming-validation in that AvroParquetWriter by default?

@knotenpunkt
Copy link

Actually we used that library by patching that avro-schema-java-file with some maven magic:

 private static String validateName(String name) {
//    if (!validateNames.get())
//      return name; // not validating names
//    if (name == null)
//      throw new SchemaParseException("Null name");
//    int length = name.length();
//    if (length == 0)
//      throw new SchemaParseException("Empty name");
//    char first = name.charAt(0);
//    if (!(Character.isLetter(first) || first == '_'))
//      throw new SchemaParseException("Illegal initial character: " + name);
//    for (int i = 1; i < length; i++) {
//      char c = name.charAt(i);
//      if (!(Character.isLetterOrDigit(c) || c == '_'))
//        throw new SchemaParseException("Illegal character in: " + name);
//    }
    return name;
  }

This pull-request can solve that, so we don't have to trick like that any more?!

@knotenpunkt
Copy link

As explain in AVRO-3704, this is to allow, in Java, to choose name validation type. Currently, there is only a choice between no validation at all or a validation that accept accent, which is not accepted by official doc.

How was it possible to choose no validation at all, if i use AvroParquetWriter ?

@clesaec
Copy link
Contributor Author

clesaec commented Feb 27, 2023

As NameValidator is an interface, in fact, you can also put your own customize name validator.
For AvroParquetWriter, it seems to take directly an Avro Schema, and as name validation is only called on Schema construction, it should be OK. I don't have example to give as AvroParquetWriter is owned by parquet github project.

@knotenpunkt
Copy link

As NameValidator is an interface, in fact, you can also put your own customize name validator. For AvroParquetWriter, it seems to take directly an Avro Schema, and as name validation is only called on Schema construction, it should be OK. I don't have example to give as AvroParquetWriter is owned by parquet github project.

Yes you are right, it takes a normal AvroSchema directly.
We build that Schema from that SchemaBuilder (this repository^^)
But how can i set in that Builder my own customize name validator or how can i disable it at all?

@clesaec
Copy link
Contributor Author

clesaec commented Mar 1, 2023

Indeed, I had not taken into account SchemaBuilder (only Schema parser).
So, i know put a public setter on static threadlocal Schema NameValidator; so, you can use Schema.setNameValidator(Schema.NameValidator.NO_VALIDATION);
(May be a better solution would have to add a method "SchemaBuilder.builder(NameValidator v)" and then use it, but it would also imply a lot of change (in class Name, ...) for only one PR)

@knotenpunkt
Copy link

Indeed, I had not taken into account SchemaBuilder (only Schema parser).
So, i know put a public setter on static threadlocal Schema NameValidator; so, you can use Schema.setNameValidator(Schema.NameValidator.NO_VALIDATION);
(May be a better solution would have to add a method "SchemaBuilder.builder(NameValidator v)" and then use it, but it would also imply a lot of change (in class Name, ...) for only one PR)

yeah your last commit AVRO-3704: add setter to static name validator should do the trick. So but i think we must wait until this pr is merged? What do you think when it would happen?

And you are right, a solution by injecting the validation rule into the builder direclty would be nicer, but doing this over that static method is also ok.

May i ask one other question: Why do you store that validation rule in an Threadlocal and not just in a (volatile or normal) variable? -> performance reasons?

@clesaec
Copy link
Contributor Author

clesaec commented Mar 1, 2023

Store in static to be accessible from anywhere without modifying code.
In thread local to allow several thread to work with different name validator (may be InheritableThreadLocal would be even better ?).
For volatile, as an instance of name validator has no state, there are no reason to declare it volatile.

@knotenpunkt
Copy link

are there any updates, when this comes into the master?^^

@clesaec
Copy link
Contributor Author

clesaec commented Apr 26, 2023

No, this add possibilities but this is fully compatible with current master.

@knotenpunkt
Copy link

@clesaec are you sure?

this one isnt in the master, so i cant change it or deactivate the nameValidator:

c60a5dc

@clesaec
Copy link
Contributor Author

clesaec commented May 2, 2023

I meant that as by default, the new validator behave like current validation, the merge of this PR in master won't modify current behavior.

setNameValidator is indeed a new method in this PR, allow you to update validation method.
Is that answer your question ?

@clesaec clesaec requested a review from martin-g June 12, 2023 11:33
@clesaec
Copy link
Contributor Author

clesaec commented Jun 12, 2023

@RyanSkraba : WDYT about this PR for Avro naming ?

@RyanSkraba RyanSkraba self-requested a review June 14, 2023 16:15
@RyanSkraba
Copy link
Contributor

Hey -- this is on my radar! I should have a bit more time once the release 1.11.2 candidate is (finally) done. Thanks for your patience...

@clesaec clesaec merged commit 0c862ca into apache:master Sep 12, 2023
13 checks passed
RanbirK pushed a commit to RanbirK/avro that referenced this pull request May 13, 2024
* AVRO-3704: name validator interface
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Java Pull Requests for Java binding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants