Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MSHARED-1453] Canonicalize properties files #77

Closed
wants to merge 15 commits into from
Closed

[MSHARED-1453] Canonicalize properties files #77

wants to merge 15 commits into from

Conversation

elharo
Copy link
Contributor

@elharo elharo commented Nov 26, 2024

No description provided.

@elharo elharo requested a review from gnodet November 26, 2024 16:23
pw.println(l);
for (String line : lines) {
writer.write(line);
writer.write( '\n' );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

System.lineSeparator() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moot. This code can be deleted now that we're using properties.store

@@ -71,8 +71,8 @@ private void createPropertiesFile(Properties properties, Path outputFile, boolea
return;
}

try (PrintWriter pw = new PrintWriter(outputFile.toFile(), StandardCharsets.ISO_8859_1.name());
StringWriter sw = new StringWriter()) {
try ( Writer writer = Files.newBufferedWriter(outputFile, StandardCharsets.ISO_8859_1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loadPropertiesFile method let the Properties class decide the charset used (so it's using ISO_8859_1, as it's the default). I think we should use an OutputStream here and pass it to properties.store() without the charset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea. That makes this much simpler.

Copy link
Contributor Author

@elharo elharo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code sorted the properties and removed comments. Ive dropped that for now. Was it necessary for reproducible builds or something?

@@ -71,8 +71,8 @@ private void createPropertiesFile(Properties properties, Path outputFile, boolea
return;
}

try (PrintWriter pw = new PrintWriter(outputFile.toFile(), StandardCharsets.ISO_8859_1.name());
StringWriter sw = new StringWriter()) {
try ( Writer writer = Files.newBufferedWriter(outputFile, StandardCharsets.ISO_8859_1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea. That makes this much simpler.

pw.println(l);
for (String line : lines) {
writer.write(line);
writer.write( '\n' );
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moot. This code can be deleted now that we're using properties.store

@gnodet
Copy link
Contributor

gnodet commented Nov 26, 2024

The old code sorted the properties and removed comments. Ive dropped that for now. Was it necessary for reproducible builds or something?

Ah, could be. Having a stable output is really important imho.

@elharo elharo marked this pull request as draft November 26, 2024 18:14
@elharo
Copy link
Contributor Author

elharo commented Nov 26, 2024

OK. I'm going to add some tests to this too. The class is sorely lacking in them.

@elharo elharo changed the title A few nits picke dup by IntelliJ A few nits picked up by IntelliJ Nov 26, 2024
pw.println(l);
}
try (OutputStream out = Files.newOutputStream(outputFile)) {
properties.store(out, null);
Copy link
Contributor

@gnodet gnodet Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can use properties.store here.
The removal of comments and the ordering is definitely important for reproducible builds.
We could refactor the code to use BufferedReader.lines(), filter out comments, sort, and print using the streams api.

for (String key : sortedPropertyNames) {
out.write(key);
out.write(": ");
out.write(unsortedProperties.getProperty(key));
Copy link
Contributor

@gnodet gnodet Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I see why the original code wrote out a properties file and read it back in. That avoided the need to reimplement the escaping.

Copy link
Contributor Author

@elharo elharo Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, that might not always work. It relies on the Properties format being consistent across VMs and Java versions and it doesn't have to be. We're accounting for comments, separator character, and ordering here, but there are other possible differences that can occur. It's risky to assume that Eclipse Temurin 21 is going to produce the same output as OpenJDK 8.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's risky. That part is very clearly specified. Changing the way properties file are stored would be a big breakage.
See https://docs.oracle.com/javase/8/docs/api/java/util/Properties.html#store-java.io.Writer-java.lang.String- for an exact explanation of what is written.
The way key / separator / value are written is clearly specified and has not changed from JDK 8 to JDK 24.

Then every entry in this Properties table is written out, one per line. For each entry the key string is written, then an ASCII =, then the associated element string. For the key, all space characters are written with a preceding \ character. For the element, leading space characters, but not embedded or trailing space characters, are written with a preceding \ character. The key and element characters #, !, =, and : are written with a preceding backslash to ensure that they are properly loaded.

Feel free to re-implement this mechanism, but I'm not really sure it's worth it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To guarantee the output, I think we are going to need to reimplement the escaping here. This could be tricky. What are the rules about copy-pasting OpenJDK code into an Apache project? We probably can't do that since it's GPL, not ASpache licensed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also:

[...] the input/output stream is encoded in ISO 8859-1 character encoding. Characters that cannot be directly represented in this encoding can be written using Unicode escapes as defined in section [3.3 (https://docs.oracle.com/javase/specs/jls/se24/html/jls-3.html#jls-3.3) of The Java Language Specification; only a single 'u' character is allowed in an escape sequence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To guarantee the output, I think we are going to need to reimplement the escaping here. This could be tricky. What are the rules about copy-pasting OpenJDK code into an Apache project? We probably can't do that since it's GPL, not ASpache licensed.

The output is already guaranteed by the javadoc of the Properties.store() method. I really don't see what additional guarantees you're looking for.

try (Writer out = Files.newBufferedWriter(outputFile, StandardCharsets.ISO_8859_1)) {
for (String key : sortedPropertyNames) {
out.write(key);
out.write(": ");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The separator used in properties.store() is = without spaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Colon is allowed. Per wikipedia "There are 3 delimiting characters: equal ('='), colon (':') and whitespace (' ', '\t' and '\f')." but I'll change it. Note the test case passes.

@elharo elharo changed the title A few nits picked up by IntelliJ [MSHARED-1453] Canonicalize properties files Nov 26, 2024
@gnodet
Copy link
Contributor

gnodet commented Nov 26, 2024

I think the assumption driving this issue is wrong. All JDK output the exact same file for a given output, apart from: the comment date, and the order of properties.
The output is very clearly specified in the Properties javadoc, and much more detailed than just a way to produce an output which can later be loaded.

@gnodet
Copy link
Contributor

gnodet commented Nov 26, 2024

The javadoc from JDK 1.4 is actually more concise, because it did only support ISO 8859-1 at that time:

Then every entry in this Properties table is written out, one per line. For each entry the key string is written, then an ASCII =, then the associated element string. Each character of the key and element strings is examined to see whether it should be rendered as an escape sequence. The ASCII characters , tab, form feed, newline, and carriage return are written as \, \t, \f \n, and \r, respectively. Characters less than \u0020 and characters greater than \u007E are written as \uxxxx for the appropriate hexadecimal value xxxx. For the key, all space characters are written with a preceding \ character. For the element, leading space characters, but not embedded or trailing space characters, are written with a preceding \ character. The key and element characters #, !, =, and : are written with a preceding backslash to ensure that they are properly loaded.

@gnodet
Copy link
Contributor

gnodet commented Nov 26, 2024

I'd go for something like:

    private void createPropertiesFile(Properties properties, Path outputFile)
            throws IOException {
        Path outputDir = outputFile.getParent();
        if (outputDir != null && !Files.isDirectory(outputDir)) {
            Files.createDirectories(outputDir);
        }
        StringWriter sw = new StringWriter();
        properties.store(sw, null);
        String nl = System.lineSeparator();
        String output = Stream.of(sw.toString().split("\\R"))
                .filter(line -> !line.startsWith("#"))
                .sorted()
                .collect(Collectors.joining(nl, "", nl));
        try (Writer pw = new CachingWriter(outputFile, StandardCharsets.ISO_8859_1)) {
            pw.write(output);
        }
    }

}
}
}

private static String escape(String s) {
String escaped = StringEscapeUtils.escapeJava(s);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's wrong. It's not general java escaping mechanism. The escaping is specific to the properties file. There are rules for spaces and separators =, :, and the rules are slightly different for the key and for the values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know. I'm working on the additional pieces. That's just the quickest way to handle the Unicode part.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the quickest way is to not reimplement the whole thing.
Please look at https://github.com/apache/maven-archiver/pull/79/files

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unicode encoding is clearly not well supported in the current code, but the PR I pointed above fixes it. The reason is that calling Properties.store(Writer) bypasses the encoding, while calling Properties.store(OutputStream) correctly supports unicode.

@michael-o
Copy link
Member

I do remember @hboutemy doing this already somewhere...

// Now read the file directly to check for alphabetical order and encoding
List<String> contents = Files.readAllLines(pomPropertiesFile, StandardCharsets.ISO_8859_1);
assertEquals(4, contents.size());
assertEquals("a\\ key\\ with\\\twhitespace=value\\ with\\\twhitespace", contents.get(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test is wrong afaik. The mechanism is different for keys and values

Then every entry in this Properties table is written out, one per line. For each entry the key string is written, then an ASCII =, then the associated element string. For the key, all space characters are written with a preceding \ character. For the element, leading space characters, but not embedded or trailing space characters, are written with a preceding \ character. The key and element characters #, !, =, and : are written with a preceding backslash to ensure that they are properly loaded.

if (Character.isWhitespace(c) || c == '#' || c == '!' || c == '=' || c == ':') { // backslash escape
sb.append('\\');
sb.append(c);
} else if (c < 256) { // 8859-1
Copy link
Contributor

@gnodet gnodet Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check is wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK,. this one I don't see. What's wrong here?

Copy link
Contributor

@gnodet gnodet Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following test succeeds:

        Properties p = new Properties();
        p.put("foo", "aéüb");
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        p.store(baos, null);
        String s = baos.toString();
        assertTrue(s.contains("foo=a\\u00E9\\u00FCb"));

So non plain ascii chars should be unicode encoded.

Copy link
Contributor

@gnodet gnodet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to spend more time on re-implementing something from the JDK for no benefit really...

@michael-o
Copy link
Member

I don't want to spend more time on re-implementing something from the JDK for no benefit really...

Me as well.

@elharo elharo closed this Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants